Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use CUDA docker image 11.0.3 tag. #8139

Merged
merged 7 commits into from
Aug 7, 2022
Merged

Conversation

trivialfis
Copy link
Member

No description provided.

@trivialfis
Copy link
Member Author

Trying to unblock CI.

@trivialfis
Copy link
Member Author

Close #8114

@trivialfis
Copy link
Member Author

@hcho3 I think the segfault in Python test is caused by compiler difference. NCCL is probably compiled by other toolchains and we are statically linking it into XGBoost. I got a gdb backtrace:

Thread 98 "python" received signal SIGBUS, Bus error.
[Switching to Thread 0x7f31adbfe700 (LWP 504172)]
__memset_avx2_erms () at ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:145
145	../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: No such file or directory.
(gdb) up
#1  0x00007f3308f0cc03 in ncclShmOpen(char*, int, void**, void**, int) () from /home/jiaming/.local/lib/python3.8/site-packages/xgboost/lib/libxgboost.so
(gdb) 
#2  0x00007f3308f03fd1 in ncclProxyService(void*) () from /home/jiaming/.local/lib/python3.8/site-packages/xgboost/lib/libxgboost.so
(gdb) 
#3  0x00007f338d2c86db in start_thread (arg=0x7f31adbfe700) at pthread_create.c:463
463	pthread_create.c: No such file or directory.

#0  0x00007f338d2d291e in __libc_recv (fd=129, buf=0x7ffe1728e1ee, len=6, flags=64) at ../sysdeps/unix/sysv/linux/recv.c:28
28	../sysdeps/unix/sysv/linux/recv.c: No such file or directory.
(gdb) up
#1  0x00007f3308f0a997 in ncclSocketProgress(int, ncclSocket*, void*, int, int*) [clone .constprop.3] () from /home/jiaming/.local/lib/python3.8/site-packages/xgboost/lib/libxgboost.so
(gdb) 
#2  0x00007f3308f0c858 in ncclSocketRecv(ncclSocket*, void*, int) () from /home/jiaming/.local/lib/python3.8/site-packages/xgboost/lib/libxgboost.so
(gdb) 
#3  0x00007f3308f06a85 in ncclProxyConnect(ncclComm*, int, int, int, ncclProxyConnector*) () from /home/jiaming/.local/lib/python3.8/site-packages/xgboost/lib/libxgboost.so
(gdb) 
#4  0x00007f3308ee55eb in initTransportsRank(ncclComm*, ncclUniqueId*) () from /home/jiaming/.local/lib/python3.8/site-packages/xgboost/lib/libxgboost.so
(gdb) 
#5  0x00007f3308ee691b in ncclCommInitRankSync(ncclComm**, int, ncclUniqueId, int, int) () from /home/jiaming/.local/lib/python3.8/site-packages/xgboost/lib/libxgboost.so
(gdb) 
#6  0x00007f3308ee73ca in ncclCommInitRankDev(ncclComm**, int, ncclUniqueId, int, int) () from /home/jiaming/.local/lib/python3.8/site-packages/xgboost/lib/libxgboost.so
(gdb) 
#7  0x00007f3308ee77e3 in ncclCommInitRank () from /home/jiaming/.local/lib/python3.8/site-packages/xgboost/lib/libxgboost.so
(gdb) 
#8  0x00007f3308c1f5b2 in dh::NcclAllReducer::DoInit(int) () from /home/jiaming/.local/lib/python3.8/site-packages/xgboost/lib/libxgboost.so
(gdb) 
#9  0x00007f3308ed5339 in xgboost::tree::GPUHistMaker::InitDataOnce(xgboost::DMatrix*) () from /home/jiaming/.local/lib/python3.8/site-packages/xgboost/lib/libxgboost.so

@trivialfis
Copy link
Member Author

It's caused by nccl failing to allocate shared memory.

@trivialfis
Copy link
Member Author

Will make a better check and open an issue on nccl as followups. Please review this PR for unblocking the CI.

@trivialfis trivialfis merged commit bcc8679 into dmlc:master Aug 7, 2022
@trivialfis trivialfis deleted the cuda-11.0.3 branch August 7, 2022 08:32
trivialfis added a commit to trivialfis/xgboost that referenced this pull request Aug 12, 2022
trivialfis added a commit that referenced this pull request Aug 12, 2022
* Update CUDA docker image and NCCL. (#8139)

* Rest of the CI.

* CPU test dependencies.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants